Mapping Hindi-English Text Re-use Document Pairs

نویسندگان

Parth Gupta

Khushboo Singhal

چکیده

An approach to find the most probable English source document for the given Hindi suspicious document is presented. The approach does not involve any complex method of Machine Translation (MT) as a language normalization pre-processing step, rather it relies on standard cross-language resources available between Hindi-English and calculates the similarity using the Okapi BM25 model. We also present the further improvements in the system after the analysis and discuss the challenges involved. The system is developed as a part of CLiTR competition and uses the CLiTR-Dataset for the experimentation. The approach achieves the recall of 0.90 the highest and F-measure of 0.79 the 2 highest reported on the Dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiword Named Entities Extraction from Cross-Language Text Re-use

In practice, many named entities (NEs) are multiword. Most of the research, done on mining the NEs from the comparable corpora, is focused on the single word transliterated NEs. This work presents an approach to mine Multiword Named Entities (MWNEs) from the text re-use document pairs. Text re-use, at document level, can be seen as noisy parallel or comparable text based on the level of obfusca...

متن کامل

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text w...

متن کامل

Microsoft Word - 19. OK_Revised [RegDone-3-4_305]_Mapping Parallel English _11-03_ CR-S-R

In this paper, we present a methodology for one to one (1:1) mapping of parallel English-Hindi parallel sentences. This methodology is based on the development of parallel English-Hindi word dictionary after syntactically and semantically analysis of the English-Hindi source text. We are using this methodology for the English and Hindi sentences, but the methodology can also be used for other l...

متن کامل

Supporting Large English-Hindi Parallel Corpus using Word Alignment

This paper gives description about methodology to understand parallel English-Hindi sentences using word alignment. This methodology is foundation to develop the parallel EnglishHindi word dictionary after syntactically and semantically analysis of the English-Hindi source text. Methodology of proposed system is used for the English and Hindi sentences; also the methodology can be used for othe...

متن کامل

Script Independent Word Spotting in Multilingual Documents

This paper describes a method for script independent word spotting in multilingual handwritten and machine printed documents. The system accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word. The system is divided into two main components. The first component known as Indexer, performs indexi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Mapping Hindi-English Text Re-use Document Pairs

نویسندگان

چکیده

منابع مشابه

Multiword Named Entities Extraction from Cross-Language Text Re-use

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

Microsoft Word - 19. OK_Revised [RegDone-3-4_305]_Mapping Parallel English _11-03_ CR-S-R

Supporting Large English-Hindi Parallel Corpus using Word Alignment

Script Independent Word Spotting in Multilingual Documents

عنوان ژورنال:

اشتراک گذاری